Digitization Issues
Report: Working Group on Digitization Issues
Richard Sherwood, Richard Mahoney, and Helen Aristar-Dry (chair)
Digital preservation of scientific data is a relatively new enterprise; but as early as 2001 plans were underway to create distributed digital archives of anthropological material (Clark et al, 2001). Various types of anthropological material lend themselves to preservation in a digital archive. The working group identified at least six types. Examples of these are provided in Table 1.
Type |
Examples |
Images |
Photographs, maps of excavation sites, biomedical images (MRIs, radiographs) |
Texts |
Field notes, annotations, excavation plans |
Audio |
Recordings of songs, conversations, oral histories |
Video |
Recordings of cultural events, conversations, archaeological excavations |
Databases |
Database of skull measurements, lexical items |
3-D scans |
Scan of fossil or artifact |
Table 1: Anthropological data
To preserve such data for long term use, researchers must ensure long term ‘intelligibility’ in both human and computational terms. ‘Human intelligibility,’ of course, refers to the ability of future researchers to understand the information; this is too often compromised by the lack of documentation accompanying the digital file. ‘Computational intelligibility’ refers to the ability of future hardware and software to interpret the file format; and this can be compromised by the pace of technological change. Since the 1996 report of the Taskforce on Digital Archiving (Garrett and Waters, 1996), it is commonplace to remark on the ‘digital dark age’ which is threatened by the rapid obsolescence of physical recording media and the equally rapid obsolescence of operating systems and file formats. Simons (2006) noted that physical media have declined in durability over the years, contrasting the long term legibility of inscriptions in stone with the many different types of storage media in use in the past 25 years (5.25” floppies, 3.5” floppies, Zip drives, Memory sticks, CD’s, DVDs, Blu-ray discs). The obsolescence of operating systems and file formats is even more striking: current version of MS Word cannot read documents created in Word 1.0.
To address the threat to human intelligibility, researchers are advised to: (a) write metadata and keep it with the file, (b) use standardized vocabularies and abbreviations wherever possible, (c), document any idiosyncratic annotation or abbreviations, and (d) use Unicode for character encoding or, at the least, document any special characters used to represent international alphabets.
To address the threat of technological obsolescence, Simons (2006) recommends that researchers create an archival master in an enduring file format and deposit the archival master in a preservation archive. A preservation archive is an established institution committed to long term preservation of the digital object; a distinguishing characteristic is that a preservation archive will have a technology migration plan on which to found its claims of long term digital accessability. Thus it contrasts with a ‘web archive,’ which is often only a website serving information from a database or file directory. Web archives rarely serve genuinely interoperable material, and they regularly disappear in response to changes in institutional servers or in the responsibilities of the archive creator.
What is an ‘enduring file format’? In the acronym created by Simons, it is a file that offers LOTS. In other words, it is Lossless, Open, Transparent, and Supported by multiple vendors. Each of these desiderata deserves some discussion.
Lossless: A lossless file format is one in which no information is lost through file compression. It is uncontroversial to say, for example, that an archival master should be uncompressed and unedited (AHRC, 2009). However, copies may, of course, be made from the archival file, and these can be altered to serve as working or presentation copies. Professional archivists usually recommend that the archival master be copied once, to make a ‘presentation master,’ and that compressed and edited copies be made from the presentation master, not the archival master. Although digital copying does not harm the original file if done correctly, use of a presentation master is probably good advice: some media programs compress automatically when they save a file; and to find this out too late is to irrevocably lose part of the information on the archival master.
Although uncompressed file format s are preferable to even those with lossless compression, lossless compression is an option if uncompressed files are so large (e.g., video) that their storage is impractical. Lossless compression algorithms typically remove only redundant information (e.g., pixels of the same color in an image) and allow the full content to be recovered through the use of a decoding algorithm. ‘Lossy’ compression, on the other hand means that the so-called ‘irrelevant’ information can never be recovered; thus it is to be avoided for highly valued material. Although the difference between a compressed file and an uncompressed file may be indistinguishable to human ears and eyes, in creating a scientific archive of irreplaceable material (e.g., songs and ceremonies of a vanishing culture), we should remember that the scientific instruments of the future may be able to extract more information from the ‘noise’ on an uncompressed file than we are currently able to perceive.
Table 2 shows some common extensions of uncompressed file formats and formats employing lossless and lossy compression.
Type |
Uncompressed |
Compressed (Lossless) |
Compressed (Lossy) |
Audio: |
.wav, .aiff, .au (pcm) |
.ape, FLAC, TTA |
.mp3, .aac, .wma |
Images: |
.bmp, tiff w/o LZW |
.tiff (or .tif) w/LZW .png .gif (grayscale) |
.jpg |
Video: |
rtv |
JPEG-2000 |
MPEG-2, DV, MPEG-4 |
Text: |
.txt |
.zip |
NA |
Table 2: File extensions of compressed and uncompressed formats (Aristar-Dry, 2008)
Open: Openness refers to the fact that some file format specifications are publically available; for example, html, XML, pdf, and rtf are all ‘open standard.’ This means that any software engineer can develop programs that can read these file formats. By contrast, information in proprietary file formats will be lost when the vendor ceases to support the software. “Open standard” is different from “open source,” i.e., software whose source code is publicly available. Examples of open source software include Open Office and Mozilla Thunderbird. Open source software usually creates files in open standards. And proprietary software usually doesn’t (though there are exceptions, e.g. Adobe pdf). But for long term intelligibility, open standards are more important than open source software. Table 3 below lists some
Development |
Open |
Proprietary |
Open |
.txt, .html, .xml, .odf, .csv |
NA |
Commercial |
.rtf, .pdf |
.doc, .xls, .ppt |
Table 3: Open and proprietary standards (Aristar-Dry, 2008)
Transparent: The file format requires no special knowledge or algorithm to interpret, because there is a one-to-one correspondence between the numerical values sent to the computer and the information they represent. Plain text, for example, has a one-to-one correspondence between the characters and the computer-readable binary numbers used to represent them. Similarly, the PCM (pulse code modulation) codec, which is employed by .wav, .aiff, and cdda files, has a one-to-one correspondence between the numbers and the amplitudes of the sound wave. Thus plain text files (.txt) can be read by any software program that processes text. And PCM signals can be interpreted by virtually all audio programs. By contrast, .zip and .mp3 files require implementation of a complex algorithm to restore the original correspondences.
Today many programs provide automatic decoding of the common encoded formats. But we cannot be certain that these programs will not become obsolete. In the distant future, some of the encoding algorithms may be lost; and, at that point, interpreting compressed and opaque files will become a costly scientific endeavor.
Supported by multiple vendors: Just as lack of compression and transparency are paired in file formats, use of open standards and support by multiple vendors go together in software development. Open standards are more likely than proprietary standards to have wide vendor support, because development using open standards is typically less costly. If a file format is open, there is no inherent barrier to creating another program that handles it. It is not necessary to reverse engineer the format or purchase the specification from the developer. And the more software applications that handle a file format, the less likely that format is to fall victim to hardware and software obsolescence.
As noted above, these recommendations are intended to apply to the archival master, not to presentation copies or working copies. However, even with archival masters, some caveats are in order. Transparency, for example, is not possible with some advanced visualization techniques, e.g., 3-D scanning, CT (computed tomography), GIS. And sometimes the ideal is simply not achievable, either in format or equipment. For example, a laser scanner is recommended for x-rays; but these machines cost upwards of $20,000 and are often out of reach of small projects. Some archivists, therefore, speak not only of ‘best practices’ in digital preservation, but also of ‘good practices’ or even ‘pretty good practices’—i.e. practices that will suffice when the ideal is unattainable. They also emphasize that situation and type of data must always be taken into account.
For example, best practice is to record audio at 24bit, 96 Khz sampling rate, in stereo; and this is ideal for data which will be subjected to phonetic analysis. BUT 16bit, 44.1Khz may be adequate for an oral history (especially since playback machines for 24bit/96 Khz are not widely available). Similarly, best practice is to scan images at 600 dpi. But 300 dpi may be preferred in some cases—for example, on x-rays where scanning with increased dpi would actually make the x-ray less intelligible because it exceeds the resolution of the image. And best practice for text is to output plain text annotated in XML (which captures content, not just formatting). But software to support XML writing and editing isn’t always available; in that case, good practice is to use any kind of structured data format (e.g., a spreadsheet or a word processing format with a stylesheet), and to provide metadata and explanatory annotations for the content.
Technical recommendations for digital preservation are, of course, a moving target. Technology changes so rapidly that regular consultation of up-to-date websites is recommended for all anthropologists interested in preparing their data for long term digital preservation. The bibliography at the end of this report lists some general resources which are worth investigating, as well as several specific to audio, image, and video standards. More such resources will no doubt become available as more domain experts become involved in adapting general recommendations for digital archiving to the goals and procedures of specific disciplines.
Works Cited
ARSC Technical Committee. 2009. Preservation of Archival Sound Recordings, Version 1, April 2009. http://www.arsc-audio.org/pdf/ARSCTC_preservation.pdf
Clark, Jeffrey T., Brian M. Slator, Aaron Bergstrom, Francis Larson, Richard Frovarp, James E. Landrum III, William Perrizo. 2001. “Preservation and Access of Cultural Heritage Objects through a Digital Archive Network for Anthropology,” Virtual Systems and MultiMedia, International Conference on, pp. 28, Seventh International Conference on Virtual Systems and Multimedia (VSMM’01).
Aristar-Dry, Helen. 2008. Preserving Digital Language Materials: Some Considerations for Community Initiatives. In Language and Poverty (ed. Wayne Harbert, Sally McConnell-Ginet, and Amanda Lynn Miller). Multilingual Matters. 202-222.
Garrett, John, and Donald Waters. 1996. “Preserving Digital Information: Report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group.Washington, DC: Commission on Preservation and Access.” http://www.rlg.org/ArchTF/tfadi.index.htm
Howard, Roger. Wed Apr 09 2003. http://eclipse.wustl.edu/~listmgr/imagelib/Apr2003/0011.html
Simons, Gary F. 2006. Ensuring that digital data last: The priority of archival form over working form and presentation form. An expanded version of a paper originally presented at the: EMELD Symposium on ”Endangered Data vs. Enduring Practice,” Linguistic Society of America annual meeting, 8-11 January 2004, Boston, MA. http://www.sil.org/silewp/2006/003/SILEWP2006-003.htm
Additional Resources on Digital Preservation
NARA (2004):
http://www.archives.gov/preservation/technical/guidelines.html
New Jersey Digital Highway Project(2007?):
http://www.njdigitalhighway.org/digitizing_collections_libr.php
NINCH (2002): http://www.ninch.org/guide.pdf
E-MELD School of Best Practices in Digital Language Documentation: http: emeld.org/school/
Additional Information on Audio
Sound Directions (2009):
http://www.dlib.indiana.edu/projects/sounddirections/papersPresent/index.shtml
CDP Digital Audio Working Group (2006)
http://www.bcr.org/cdp/best/digital-audio-bp.pdf
U. of Maryland Libraries (2007):
http://www.lib.umd.edu/dcr/publications/best_practice.pdf (2007)
Additional Information on Images and Video
Visual Arts Data Service (2000?):
http://vads.ahds.ac.uk/guides/creating_guide/sect31.html
California Digital Libraries (2008):
http://www.cdlib.org/inside/diglib/guidelines/bpgimages/
Washington State Library:
http://digitalwa.statelib.wa.gov/newsite/best.htm
BCR Digital Images Working Group (2008):
http://www.bcr.org/cdp/best/digital-imaging-bp.pdf
f the working copy is the primary copy—as, for example, during the ongoing creation of a database—it is important to export the information regularly into an enduring file format. For databases (which are usually managed by proprietary software) this means to export the data regularly into properly documented plain text. A .txt file with informative XML markup is ideal, but often the XML automatically output by a program will be only minimally helpful to someone trying to make sense of the file. In that case, a file including metadata identifying the fields and tables should be created and stored with the database output.
For example, Acrobat 7.0 will automatically compress large pdf files (see: http://www.planetpdf.com/forumarchive/166948.asp). Most importantly, however, as of this writing, most video capture programs automatically compress the audio track along with the video when it is downloaded to a computer. For that reason, linguists and musicologists are advised to make a separate audio recording, using a device like a hand-clap at the beginning to aid in synchronizing the files later on. See: http://emeld.org/school/classroom/video/field.html#1006
As noted by a Senior Media Specialist at the Getty Museum, “Uncompressed data is trivial to decode, compressed data often is not. This makes for easier long-term viability of the file . . . . “ Furthermore, uncompressed data is less prone to loss: “Lossless compression means that a single bit in the compressed file may represent multiple bits in the uncompressed version. This magnifies potential damage caused by bit corruption. In an uncompressed file a single flipped bit will have little overall impact on the renderability of an image. In a lossless compressed file depending on whether the corruption is in the dictionary (in the header) or in image data it can have a larger effect. And in a lossy compression scheme a single bit corrupted can be extremely noticeable.” (Howard, 2003).
Technically, .wav and .aiff are container formats, file structures which allow combining of audio/video data, tags, menus, subtitles and some other media elements. They could theoretically contain compressed audio formats, but in practice they usually contain PCM (pulse code modulation) data, which is an uncompressed format.
Apple audio codec (.aac) and Windows media audio (.wma) both have a lossless version. Confusingly, both the lossless and the lossy compression formats use the same file extension.
Return to Chair Reports